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Abstract 



0^ We analyse sequential Markov coalescent algorithms for populations with demographic structure: for a bottleneck model, a 
population-divergence model, and for a two-island model with migration. The sequential Markov coalescent method is an ap- 
proximation to the coalescent suggested by McVean and Cardin, and Marjoram and Wall. Within this algorithm we compute, for 
two individuals randomly sampled from the population, the correlation between times to the most recent common ancestor and the 
linkage probability corresponding to two different loci with recombination rate R between them. We find that the sequential Markov 

L- 5 coalescent method approximates the coalescent well in general in models with demographic structure. An exception is the case 
where individuals are sampled from populations separated by reduced gene flow. In this situation, the gene-history correlations 

t— i may be significantly underestimated. We explain why this is the case. 
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1. Introduction 



Fast sequencing and genotyping techniques have facilitated 
Q-fthe collection of genome-wide data sets of genetic varia 



tion i n several organisms, including humans (Altshu ler et al. 



2000), maize (Yu and Buckler 



2005 1: iThe International HapMap Consortiumi 2007 ), mic e 



20061) 



(ILindblad-Toh et al 
* aid th e flowering plant Arabidopsis thaliana dNordborg et al 
\Q 120051) . The empirically observed patterns of genetic variation 
are shaped by the genetic history of the population in question 
which in turn is determined by geographic, historical, and eco- 
logical factors. Together with mutations and linkage disequilib- 
rium, these factors give rise to the observed patterns. 

Today the most common theoretical tool for analysing em- 
irical data of genetic va r iation is the coalesce nt algorithm 



o 
o 



. £ feingmanl 119821: iGriffithsl I198U iHudsonl 119831) . In its sim- 
plest form, the coalescent algorithm assumes a randomly 
$-H mixing neutral population of constant size. The algorithm 



has been extended to include intragenic recombination, gene 
conversion, population structure (both geographic and demo- 
graphic) and temporal variations of the population size, such 
as po pulation expansions and bottlenecks (see, e.g., Nordborg, 
1200 ll for a review). Last but not least, coalescent theory is 
now also routinely used in studying the effe cts of selection 
( The International HapMap Consortiumi l2007t) . 

Many sophisticated methods have been developed to ana- 
lyze empirical genetic data, often based on sampling gene ge- 
nealogies by means of coalescent theory under different popula- 
tion models ( Marjoram and Tavare , 20061 Liang and Kelemerl 
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2008). In order to arrive at accurate inferences with these meth- 



ods it is essential to sample the genealogies which are most 
likely to underly the observed data. As a consequence, the anal- 
ysis ty pically requires very large nu mbers of samples to be gen- 
erated (Marjoram and Tavarell2006l) . 

Experience shows that the standard implementations of the 
coalescent algorithm are very efficient for short genomic re- 
gions (of lengths of the order of a few hundred kb to a few 
Mb). For long genomic sequences, by contrast, little computa- 
tional efficiency is gained by skipping over uninteresting gener- 
ations, and substantial computational effort is expended track- 
ing recombination events, their positions, and the many ances- 
tral fragments of each sequence as they repeatedly recombine 
and coalesce with each other. As a consequence, the compu- 
tational time increases rapidly with the size of the region of 
interest, and with the sample size. 

Several authors have proposed solutions to this problem: 



Liang et alJ (120071) have introduced a method for efficient sim- 



erikssonOchalmers . se (A. Eriksson) 



ulation of many short sequences spread over a long chromo- 
some. Rather than skipping over all generations where the an- 
cestral lines remain unchanged, genealogies are traced gener- 
ation by generation back through time, and several coalescent 
and recombination events are allowed to happen simultaneously 
in each generation. This leads to a significant improvement in 
performance over standard simulation packages. However, this 
method does not simulate a contiguous locus, but rather iso- 
lated loci spread over a large genomic region. Because the work 
necessary to simulate gene genealogies of large contiguous ge- 
nomic regions is ultimately bound by the number of genealog- 
ical events during the h istory of the sampl e, it is not clear how 
efficient the method of iLiang et alJ (12007b would be if it were 
applied to large contiguous genomic regions. 
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The sequential M arkov coalescent (SMC) method of 
McVean and Cardin (l2005h . and the related method of 
Marj oram and Wail ( 2006 ). called SMC method, are approx- 
imations to the coalescent algorithm. By construction, these 
methods result in the same marginal (single-locus) gene ge- 
nealogies as the coalescent, but correlations between differ- 
ent loci are only approximately represented. The advantage of 
the SMC and SMC methods is that they are much more effi- 
cient than the standard coalescent algorithm, in particular when 
genome-wide gene genealogies must be sampled. 

In the simplest case of a neutral, structureless population 
with constant population size, the SMC and SMC algorithms, 
while approximate, accurately reproduce the dependence of 
linkage on the recombination rate between two loci on the same 
chrom osome dMcVean and Cardinl l2005t iMarioram and W all. 
20061) . 

In this article we pose the question: what happens when the 
population size is not constant, or when the population exhibits 
demographic structure? To answer this question, we investi- 
gate three common alternatives to a neutral, structureless pop- 
ulation with constant population size: (a) a population bottle- 
neck model where the population size is taken to be piecewise 
constant, (b) a population-divergence model, and (c) a spatially 
structured model where the population is divided into two sub- 
populations connected by gene flow. For these three models we 
investigate the performance of the SMC algorithm. (Because 
the SMC and the SMC methods are so similar in their construc- 
tion, and the SMC results are found to be significantly closer 
to the exact results of coalescent theory, we focus on SMC al- 
gorithm in this article.) 

The remainder of this article is organised as follows: in 
Sec. |2] we review the coalescent algorithm, and briefly describe 
the SMC and SMC methods. In Sec. [3]we define the observ- 
ables studied in this article: the probability of linkage and the 
correlation function of gene histories. We discuss the relation 
between these two quantities. Further, we describe our imple- 
mentation of the SMC algorithm for the three models of pop- 
ulation structure described above. In Sec. |U we present our 
results and analysis. Finally, Sec. [5] concludes with a brief dis- 



2. Background 

2.1. The coalescent 

In this section we briefly describe the coalescent algorithm 
and its use in interpreting empirical data on genetic vari- 
ation. The coalescent algorithm has a wide range of ap- 
plications, including the inference of population history and 
structure from empirical observations of the genetic varia- 
tion of short stretches of a chromosome b y mean s of mo d- 
els of bottlenecks, popul a tion expan s ion dTaiima , 1989ba; 



hotspots, e.g. Wiuf and PosadaL 120031) . and of (empirically ob- 
served) long-range changes of the recom bination rate along 



the chromosomes in the h uman genome ( Ko ng et all [2002; 



els oi bottlenecks, population expansion ( lajimaj, iiysyfjya 
Slatkin and Hudsonl Il99lt ISano etalJ, )2004. gene-flow be 



Teshima and Taiima, 



tween sub-populations (Wakelev, 1 99_6p _ ... 

20031 IStumpf and Goldstein! l2003l)~ and complex speciation 
dPatterson et all 20061) . The coalescent algorithm has also 
been successfully employed to investigate the effect of short- 
range variation of the recombination rate (recombination 



Eriksson and Mehlig, 2005). Last but not least, a large num- 



ber of authors have studied th e effect of balancing selec- 
tionj Hudson and Kaplan! 19881). and of d i rectional selection 
(Kaplan et al., 1989; Krone and Neuhauser, 119971: IPrzeworski. 
2002i ICoop and GriffithsL l2004t lEriksson et all 120071) on the 
genetic variation of a neutral locus on the same chromosome 
as the selected locus. 

Standard implementations of the coalescent method sam- 
ple the sequen ce of genealogical events backwards in time 
((Hudson, Il990h . Usually, one is interested in the gene geneal- 
ogy of a relatively small sample of a large population. In this 
case, direct simulations of the Wright-Fisher model are inef- 
ficient (note, however, the recent advances in direct forward 
simulations which may include more co mplex models of se- 
lection and population dyn amics by, e.g., Hoggart et all 2007 



Carvaial-Rodriguea 120081) . Because all genetic variation is as- 



sumed to be selectively neutral, it is customary to first generate 
the gene genealogies of all loci and then simulate the mutation 
process in a second step, given a particular genealogy. 

The genealogical events referred to above include the coales- 
cence of two sequences into a single ancestral lineage, recom- 
bination with a the sequence, or migration between sub popu- 
lations (gene flow). Coalescent theory builds upon the assump- 
tion that these events occur infrequently. It is therefore assumed 
that such events never occur simultaneously, and that many gen- 
erations pass between consecutive events. The number of gen- 
erations between genealogical events is explicitly computed, 
and is used to skip over generations where no genealogical 
events pertaining to the sample occur. The algorithm proceeds 
until all loci have coalesced to their most recent common ances- 
tor in a given sample. The resulting genealogy is used to place 
mutation events according to an appropriate mutation model. It 
is usually assumed that mutations can be modeled by point pro- 
cesses introducing new genetic variation independently of the 
ancestral sequence, but more complex models have been stud- 
ied which are argued to more accurately reflect the emp irical 
distribution of nucleotide bases dArenas and Posada , 2007 ). 

In the absence of recombination within a contiguous section 
of a chromosome, all nucleotides in that section have the same 
genealogy, and this genealogy is a binary tree. Recombination 
breaks this association between neighbouring nucleotides be- 
cause the nucleotides to the left of the recombination point have 
a different parent than the nucleotides to the right, so that the 
gene genealogies are no longer the same. Recombination events 
are assumed to occur according to a Poisson process along 
the chromosome. The recombination rate is usually given by 
R = ANr where N is the population size and r is the probability 
of recombination between a pair of neighbouring nucleotides in 
a single generation. In this case, the joint gene genealogy of a 
contiguous segment of a chromosome can be represented as a 
graph, called the ancestral recombination graph. In Fig. Q] the 
ancestral recombination graph for a sample of two individuals 
is shown for a stretch of DNA represented by the interval [0,1]. 
Two recombination events occurred in the gene genealogy of 
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Figure 1 : Illustration of the ancestral recombination graph of a con tiguous locus 
for a sample of two individuals in the coalescent, after (Hudson, 1990). The 
locus is represented by the interval [0, 1], and G(x) denotes the genealogy at 
position x. Two recombination events in the ancestral recombination graph, the 
more recent one at xi = 0.8 and the older one at x\ = 0.3, partition the locus into 
three consecutive regions, < x < 0.3, 0.3 < x < 0.8, and 0.8 < x < 1, with 
a different gene genealogy in each region [denoted G(0), G(0.3), and G(0.8), 
respectively]. G(0.3) is the genealogy resulting from taking the right branch in 
the first recombination event and the left branch in the second recombination 
event in the ancestral recombination graph. Similarly, the genealogy G(0.8) 
corresponds to taking the rightmost branch in both recombination events. 



the sample, at positions xi = 0.3 and x 2 = 0.8 along the stretch 
[0, 1]. The recombination events separate the gene genealogies 
into three different groups. If G(x) denotes the gene genealogy 
of the nucleotides at position x of the chromosome, then G(x) is 
a piecewise constant function of x. It is constant on the intervals 
[0, Jti), [xi,x 2 ) and [x 2 , 1]. 

This concludes our brief description of coalescent methods. 
Tab.[T]lists a number of different implementations of the coales- 
cent algorithm. The first wide ly spread implementation was the 
ms program by ( Hudsonl 2002 ). and it is still one of the most 
widely used programs. In addition to the unstructured coales- 
cent model, this implementation also supports simple models of 
population expansion and allows for sub-populations connected 
by migration. The extension of the coalescent framework to in- 
clude different aspects of evolution is reflected in a large num- 
ber of simulation programs which have emerged over the last 
few years (Tab.[TJ. 



2.2. The sequential Markov coalescent method for an unstruc- 
tured population 

An important property of the coalescent process is that the 
marginal (single-locus) distribution of gene genealogies is in- 
dependent of the position of the locus and thus independent of 
the recombination model (correlations between genealogies at 



ms Hudson (2002) 

SimCoal 2.0 Laval and Excoffier (2004) 

SelSim Spencer and Coop ("20041 

CoaSim Mailund et al. (2005) 

CoSi Schaffner et al. (2005) 

SMC McVean and Cardin (2005) 

SMC ' Marjoram and Wall (2006) 

FreGene Hoggart et al. (2007) 

Recodon Arenas and Posada (2007) 

Genome Liang et al. (2007) 

msHOT Hellenthal and Stephens (2007) 

GenomePop Carvaial-Rodriguez (2008) 

Table 1: Simulation pa ckages for the c oalescent (the list is not complete). Note 
that the ms package of Hudson 1 2002) is much older than the publication date 
indicates. 



different loci, by contrast, depend on the recombinat ion model). 
Based on this observation. lWiuf and Heinldl999blal) introduced 
an equivalent formulation of the coalescent process where the 
local gene-genealogical tree is followed along the chromosome, 
rather than tracing the ancestry of all loci back through time, as 
in the coalescent with recombination. 

The procedure for generating the joint gene genealogies of a 
contiguous stretch of DNA is as follows: generate the genealog- 
ical tree corresponding to the left-most nucleotide of the stretch 
for all individuals in the sample. Let T be the total length of this 
tree (the sum of the lengths of all branches in the tree), mea- 
sured in generations. Generate the location of the next recom- 
bination event on the chromosome by moving an exponentially 
distributed number of nucleotides to the right. The expected 
value of the exponential distribution is {rTY 1 . If the location is 
beyond the right end of the stretch, the procedure terminates. If 
not, consider the genealogy for the nucleotides at the recom- 
bination location. Pick a point on this genealogy randomly 
uniformly. At this point (in time) the recombination event is 
assumed to occur. Trace a line starting from this point (the 
'recombination point') back through time, and allow it merge 
with the ancestral recombination graph G(x) with the standard 
coalescent rates. This new line is added to the ancestral recom- 
bination graph, the genealogy is updated, and correspondingly 
its total length T. In this way one continues to generate recom- 
bination events on the chromosome until the right end of the 
stretch is reached. 

The method of Wiuf and Hein results in the same gene ge- 
nealogies as the coalescent process. But it is also equally com- 
putationally demanding for large sample sizes and large ge- 
nomic regions. To address this problem, iMcVean an d Cardin 
(120051) sought an approximation to the method by Wiuf and 
Hein. As before, gene genealogies are constructed by moving 
along the chromosome (from left to right). When a recombina- 
tion event occurs, the gene genealogy is updated. This update is 
illustrated in Fig. [2^. As usual, the recombination event causes 
one of the lines in the ancestral recombination graph to detach 
from the graph. The part of the line above the point of detach- 
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Figure 2: Illustration of the SMC and SMC approximations for a sample of 
three individuals. The top row shows the line re-attachment process. The given 
tree is drawn in black, the resulting tree using red lines. The re-attached line is 
dashed. In the SMC method, the line to be re-attached cannot be joined to its 
old path, whereas it can do so in the SMC method. The bottom row shows the 
resulting gene genealogies after the re-attachment process. 



ment (referred to as the recombination point in the preceding 
paragraph) is deleted from the ancestral recombination graph 
(see Fig. |2]). Up to this step, the algorithm does not differ from 
the one proposed by Wiuf and Hein. The difference lies in the 
following simplification: instead of allowing the line to attach 
to any line of the ancestral recombination graph, it is only al- 
lowed to attach to lines in the local gene tree (see Fig. |2). It 
is thus no longer necessary to keep track of the full ancestral 
recombination graph. Hence, the work necessary to simulate a 
long contiguous stretch of DNA is significantly reduced com- 
pared to the coalescent process. The drawback of this simpli- 
fication is that gene histories decorrelate faster as a function of 
the distance along the chromosome as compared with the exact 
coalescent process. The reason for this deviation lies in restrict- 
ing the re-attachment process. 

An improvement over t he SMC method was suggested by 
Marj oram and Walll d200rj|) . In their algorithm (SMC method) 
the line separated by a recombination event is allowed to attach 
also to its old path (see Fig. While retaining the speed and 
memory efficiency of the SMC method, it was found to be sig- 
nificantly more accurate when compared to the exact coalescent 
process. 



3. Methods 



Th e SMC method in its original form dMarioram and Wali , 
20Q6h applies to unstructured populations with constant popu- 
lation size. In this section, we describe how the SMC method 
can be extended to models with population structure and chang- 
ing population size. Figs.[3^-c show the three models for which 
we compare the standard coalescent to the SMC method. 
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Figure 3: Scenarios of population history and structure, (a) Population bottle- 
neck model, (b) The population-divergence model, (c) Two-island model, (d) 
A more complex scenario. 



More complex models include combinations of population di- 
vergence and expansion, bottlenecks, and migration. As an ex- 
ample of such models, Fig. [3t l depicts the ou t -of-A frica sce- 
nario for human history used by lSchaffner et aL ( 20051). see also 



(Eriksson an d Mehlig, |2004|) . ISchaffner et all (|2005) imple 



ment a model of population structure, bottlenecks, and variable 
recombination rates along the chromosomes, tuned to match 
observed of allelic spectrums of different populations (African, 
Indo-European, Asian, and American). 

3.1. Bottleneck model 

The population bottleneck model is illustrated in Fig.FJh. The 
population was of constant size Ny until 2N(T + D) generations 
ago, when the population decreased to Nx individuals during 
2ND generations. After the bottleneck, the population quickly 
expanded to the present population size, N individuals. It is 
straightforward to extend the SMC algorithm to this model. 
First, a gene genealogy for the leftmost locus is generated us- 
ing the standard coalescent method for the model. Second, the 
position of the next recombination event is generated as in the 
SMC model. Third, the recombining line is re-attached us- 
ing the same procedure as in the SMC method, but where the 
coalescent rate now depends on time, reflecting the changing 
population size. 

3.2. Population divergence 

Fig. [3j) shows a model of population divergence. In this 
model, the population was of constant size of N until 2ND gen- 
erations ago, when it diverged into two parts with no gene flow 
between the two branches, which each grow to population size 
A'. The branches may correspond to different species, or may 
correspond to sub-populations of the same species separated by 
e.g. a geographical barrier. If two individuals are sampled from 
the same branch, the population size is effectively constant. We 
therefore only consider the case where the individuals are sam- 
pled from different branches. In this case, the SMC algorithm 
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Figure 4: A sample genealogy of genes for the migration model in the SMC 
algorithm. The black solid lines refer to the genealogy before recombination 
occurred, the red lines to the new genealogy. The re-attached line is drawn as a 
red dashed line. Also shown are the sequences of states, both according to the 
old and to the new genealogy (in black and red, respectively). 



is run separately for the two branches, from the present back 
to the time of the divergence. Some ancestral lines may merge 
during the divergence. The surviving ancestral lines are pooled 
in a single population, and the original SMC algorithm for con- 
stant population size is used to find the gene genealogies. 

3.3. Island model 

We con sider the standa rd island model of population struc- 
ture due to Wright d 1 93 lb . see Fig. [3}:. The population consists 
of two sub-populations, each with N individuals, connected by 
a gene flow corresponding to m migration events per generation. 
The scaled migration rate is M — 4Nm. 

The implementation of the SMC method in this model is 
more complicated than for the models of population bottlenecks 
and divergence shown in Fig. [3^ and b. For this reason, the 
discussion in this subsection is restricted to sample size two (the 
algorithms described above apply to arbitrary sample sizes). 

In contrast to the previous models, in the island model it is 
not sufficient to keep track of the total height of a genealogy 
in order to update the gene genealogies in the SMC algorithm. 
This is because the rate of attachment depends on the number of 
existing lines present in the island where the line to be attached 
resides. Hence, it is necessary to record also the sequence of 
migration events in the genealogy. 

The re-attachment procedure for this population model is 
implemented as follows. First, let n\ and ni be the num- 
ber of ancestral lines sub populations 1 and 2, respectively 
(n,- e {0, 1,2}). Note that the variables «,■ are piecewise constant 
functions of time (Fig. 0). When a recombination event oc- 
curs, the breaking time (referred to as the recombination point 
above) is chosen uniformly randomly on the tree corresponding 
to the location of the recombination event. The detached line 
(the dashed line in Fig. 0]i is re-attached to the old tree (black 
lines) applying standard rates of coalescence, and of migration 
between sub populations. 



3.4. Gene-history correlations and probability of linkage 

In order to compare the accuracy of the SMC approxima- 
tion to the coalescent method, we study the correlation p of the 
times to the most recent common ancestor for a sample of two 
individuals, at two loci separated by recombination rate R. This 
correlation underlies empirical measures of genetic variation, 
such as the variance of the number of segregating sites in a con- 
tiguous locus. Hence, if the models differ with respect to the 
correlation p, they ar e likely to differ also with respect to other 
observables (see, e.g. Hudson! 12001 1 for a review of properties 
of genetic variation in a sample which can be described by the 
two-locus statistics). 

In the some population models, the correlation p is found 
to be equal to the probability pi of linkage between the two 
loci in question. This is the case, for instance, in the constant 
population-size model and in the population-divergence model 
(but not in the population bottleneck and two-island models). 
In some cases (see below) we have been able to compute pi 
in closed form in the SMC approximation. The corresponding 
results are described in Sec. [4] but we first review the known 
results for the exact coalescent and for the S MC approxima tion. 

For the constant population-size model. Hudson! ( 1983 ) ob- 
tained the by now well-known exact result 



PL (R) = p(R) 



R+IB 



R 2 + 13R + 18 



(1) 



Within the SMC approximation, the p robability of linkage is 
found to be (McVean and Cardin, 2005) 



Pl(R) = 



1 



1 +R 



(2) 



Hence, in the simplest, constant population-size model, the ex- 
act probability of linkage and the one obtained within the SMC 
approximation are similar for small and large values of R. But 
they differ for intermediate values of R. 

In the population-divergence model, the probability of link- 
age and the correlat ion can be calculated exa ctly using the 
method described in (Eriksson and Mehlig, 2004). One finds: 



p L (R) = p(R) = 



4(R 2 +1R + 18) 



(R + 2) 2 (R 2 + 13R+ 18) 
8(g + 6)R e-°^ 2 + (R+ 10)R 2 e -°^ +2 > 
(R + 2) 2 (R 2 + 13R + 18) 



(3) 



It sh ould be noted that this r esult differs from Eq. (14) 
in jEriksson and MehligL 2004 ). The reason is that here 
we have conditioned on both individuals being in different 
sub populations at the pre sent time, whereas the result in 
( Eri ksson and Mehlia, 120041) was averaged over all four pos- 
sibilities (both individuals in the same or in different sub popu- 
lations). 

4. Results 

In this section we describe the results obtained when apply- 
ing the SMC algorithm to the models of population structure 
described in Fig. [3^-c. 
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Figure 5: The correlation p of the time to the most recent common ancestor for 
two loci separated by recombination rate R in a sample of two individuals for the 
population bottleneck model (see Fig.[3ji), for the SMC method (symbols) and 
the coalescent (colored lines), as a function of the population size x during the 
bottleneck, for three recombination rates; R = 0.5 (green line, squares), R = 3.5 
(blue line, circles), and R = 15 (red line, triangles). The other parameters are: 
T = 0.36, D = 0. 18, and y = 1. All data points are calculated from simulations 
of 10 6 independent realisations. 



4.1. Population bottlenecks 

In the population bottleneck model, the correlation p differs 
from the probability of linkage pi. Therefore we only study the 
former. Fig.[5]shows the correlation p of the time to the most re- 
cent common ancestor for two loci separated by recombination 
rate R in a sample of two individuals for the population bottle- 
neck model [see Fig.[3k]. The correlation p is shown as a func- 
tion of the population size x during the bottleneck. The symbols 
correspond to the SMC approximation, and the coloured lines 
to the coalescent algorithm. Results for three different recom- 
bination rates are shown: R = 0.5 (green line, squares), R = 3.5 
(blue line, circles), and R = 15 (red line, triangles). The other 
parameters are: T = 0.36, D = 0.18, andy = 1. 

Fig. |5]makes it clear that in the bottleneck model, the SMC 
approximation give very similar results to the coalescent. The 
largest deviations occur for relatively small widths of the bottle- 
neck (x « 0.1) and for intermediate values of the recombination 
rate R. 



4.2. Population divergence model 

Fig- shows the correlation p of the time to the most recent 
common ancestor for two loci separated by recombination rate 
R for a sample of two individuals in the population-divergence 
model. Shown are the SMC approximation (symbols) and for 
the standard coalescent (solid lines), as a function of the recom- 
bination rate R between the loci. 

The upper panel shows how the correlation decrease as a 
function of R, for two values of the duration D of the diver- 
gence. For D = 0, corresponding to the standard constant pop- 
ulation size, we confirm that the SMC approximation gives a 
correlation p(R) which is very similar to that of the coalescent 
(blue solid line and circles). For D = 10 (red solid line and 
triangles), however, SMC approxoimation differs significantly 
from the coalescent: the correlations decrease more rapidly in 
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Figure 6: Top: the correlation p of the time to the most recent common ancestor 
for two loci separated by recombination rate R for a sample of two individuals 
in the population-divergence model (Fig.[3p); SMC approximation (symbols) 
and coalescent (lines), both as a function of the recombination rate R. All data 
points are calculated from simulations of 10 5 independent realisations. The 
dashed lines represent the probability of linkage in the SMC model, Eq. (3). 
Two choices of the time D since the divergence are shown: D = (no di- 
vergence, blue lines and circles), and D = 10 (long divergence, red lines and 
triangles). Bottom: The correlation p of the time to the most recent common an- 
cestor of the two loci in the population-divergence model, for the SMC method 
(symbols) and the coalescent (lines), as a function of the duration D of the di- 
vergence. All data points are calculated from simulations of 10 5 independent 
realisations. Again, the dashed lines represent the probability of linkage in the 
SMC model. From top to bottom: R = 0.5, R = 1, and R = 2. 



the SMC approximation than in the coalescent. While the cor- 
relations in the coalescent decrease as a power law, they de- 
crease exponentially in the SMC approximation (as will be 
shown analytically below) when D > 0. The largest differences 
occur for intermediate values of the recombination rate R. 

A similar pattern can be seen in the lower panel, where the 
correlation p is shown as a function of D for different values of 
the recombination rate. The correlation is a decreasing function 
of D, but for large values of D the decrease is very slow (we 
show below that the correlations approach a constant level for 
large values of D). For small values of D the SMC approx- 
imation yields correlations p which are close to those of the 
coalescent. The accuracy of the SMC approximation declines 
with increasing values of D, but for large values of D the error 
approaches a constant. 

It is very difficult to obtain an analytical expression for the 
correlation p in the SMC approximation. However, a compari- 
son in Fig.|6]of the probability of linkage pi to the correlation p 
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in the SMC approximation shows that, for the parameter values 
investigated, both appear to be identical (to within stochastic 
fluctuations). 

In the following we show how to analytically calculate pi. 
We first derive a general relation between the single-locus dis- 
tribution function of the time r to the most recent common an- 
cestor, /(r), and the probability of linkage, pi. Let </>(r) be 
the probability that a line re-attaches to the same branch when 
started from a point uniformly on [0, f\. We refer to the prob- 
ability 0(t) as the 'loop probability'. Because the number of 
recombination events separating two loci is Poisson distributed, 
and since each recombination event leaves the genealogy intact 
with probability 0(r), we find that the probability of linkage is 



drf(r) YW)] 
irt 



f 

Jo 



dr /(t) e 



k=0 

-[i-mvtr 



k\ 



(4) 



Thus, in order to obtain the probability of linkage for a given 
population model, one needs to calculate the loop probability. 

In Appendix [A] we summarise this calculation for the pop- 
ulation divergence model (and, as a special case, for a freely 
mixing population of constant size). The result is: 



-R/2 



PL = 



f 

Jo 



ds s 



(R-2)/4 



exp 



R(l-2e- D )(l + s) 



(5) 



This result can be expressed in closed form in terms of the in- 
complete Gamma function (see Appendix|A]for details). In Ap- 
pendix |A. 2.1 I we derive a bound for this probability: 



8g -(3-2<r D )«/4 2 e - <l - e ~ D)R 
PL ~ R 2 + 8R+ 12 + R + 6 



(6) 



Eq. © shows that the linkage probability falls off exponentially 
with increasing R for all values of D > 0. Now consider the 
exact probability of linkage in the coalescent algorithm, Eq. Q. 
According to this expression, pi falls off at a rate ~ 4/R 2 when 
RD » 1, and not exponentially as in the SMC approximation. 

Note, however, that when D = (corresponding to a freely 
mixing population of constant size), the bound © for pi in 
the SMC approximation does not decrease exponentially but 
as 2/(R + 6) for large values of R. A lower bound for this case, 
Pl ^ 1/(1 + R), is derived in Appendix IA.2.21 and confirms 
that for D = the probability of linkage decreases inversely 
proportional to R. 

In summary, for the population divergence model we have 
found the following. For sufficiently large duration of the diver- 
gence (for a large value of D), there may be significant differ- 
ences in the correlation p between the coalescent and the SMC 
approximation. The main reason for this discrepancy is that in 
the coalescent, the ancestral lines may merge and break sev- 
eral times during the divergence, but in the SMC approxima- 
tion the probability this to happen is much lower. In addition, 
we have established numerically that in the SMC approxima- 
tion, the correlation p and the probability of linkage appear to 
be identical (this is known to strictly hold in the coalescent, 




Figure 7: Correlation p of the time to the most recent common ancestor for two 
loci separated by recombination rate R in a sample of two individuals for the 
two-island model (see Fig.fJ];), as function of R for M = 0.01 (green line and 
triangles), M = 0.2 (blue line and circles), and M = 10 (red line and diamonds). 
The lines corresponds to from simulations of the coalescent, and the symbols 
to simulations of the SMC model, from 10 4 independent realisations. 



c.f. the discussion in the methods section). Last but not least, 
our analytical results within SMC approximation show that the 
probability of linkage decreases exponentially with increasing 
recombination rate, in contrast the exact coalescent where the 
probability of linkage decreases as a power law in R. 

4.3. Two-island model 

In Fig. [7] we show the correlation p of the time to the most 
recent common ancestor for two loci separated by recombina- 
tion rate R in a sample of two individuals for the two-island 
model (Fig. |3j:), as function of R for three different migration 
rates: M = 0.01 (green line and triangles), M = 0.2 (blue line 
and circles), and M — 10 (red line and diamonds). The results 
are based on 10 4 simulations of the standard two-island coales- 
cent with migration, and the two-island extension of the SMC 
method described in Section l4~3l 

When the migration rate is large, the SMC approximation 
gives results very similar to those of the coalescent. This is ex- 
pected since in this case the population behaves as a single pan- 
mictic unit. For intermediate and small migration rates, how- 
ever, differences are observed. These differences increase with 
decreasing migration rates. 

5. Discussion 

In this article we have investigated the accuracy of the SMC 
approximation to the coalescent for the population models de- 
picted in Fig. [3^-c. This is an important question since many 
populations show deviations in their allelic distributions con- 
sistent with historic population expansions, bottlenecks, and 
structure from geographic separation or preferential mating. To 
determine the accuracy we have computed the correlation of 
the time to the most recent common ancestor for two loci in 
a sample of two individuals in the SMC approximation. We 
have compared these results to the corresponding exact coales- 
cent results. In this section we briefly discuss the results for the 
three different models, and put them in a wider context. 
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In the population-bottleneck model (Fig.[3h) we have found 
that the SMC approximation works well. This result can be un- 
derstood more generally as follows. Consider a general model 
of time-dependent population-size variations, where N(t) is the 
population size as a function of time. The coalescent for this 
model can be mapped to the constant population-size model by 
introducing a (possibly nonlinear) transformation of the time r 
to a 'stretched' time 

r'= f dtN(0)/N(t). (7) 
Jo 

Because of the construction of the SMC algorithm, the coa- 
lescent and the corresponding SMC approximation are trans- 
formed in the same way by this transformation. Hence, it may 
be expected that the SMC approximation works well for ar- 
bitrary models of population expansions and bottlenecks. We 
have verified this expectation for the particular case of the bot- 
tleneck model depicted in Fig. [3^. 

In the population-divergence model (Fig.[3}5) the situation is 
different. We have calculated the linkage probability pi (which 
is equal to p in this model) both exactly and within the SMC 
approximation of the coalescent. Our analytical results show 
that when the expected number of recombination events during 
the divergence becomes large, the accuracy of the SMC ap- 
proximation deteriorates: the exact correlation decreases as a 
power law as a function of R, whereas in the SMC approxima- 
tion it decreases exponentially. 

In order to better understand this difference, let us compare 
how the divergence affects the gene genealogies in the coales- 
cent to how it affects gene genealogies in the SMC. In the coa- 
lescent, recombination can cause the genetic material to spread 
out over different ancestral lines during the divergence. The 
probability of linkage depends on whether the genetical mate- 
rial is spread over two, three or four lines at the onset of the 
divergence. In each case, however, the coalescent results in a 
probability of linkage which decreases as a power law with in- 
creasing R. 

In the SMC approximation, by contrast, the probability of 
linkage is a function of the loop probability. As explained in 
Section l4~2l the cause of the exponential decrease of the proba- 
bility of linkage in the SMC approximation can be traced back 
to how the loop probability depends on the duration of the di- 
vergence. Hence, the different behaviours of the coalescent and 
the SMC approximation can be attributed to the lack of mem- 
ory of past gene genealogies within the SMC approximation. 

Note also that if there is a severe population bottleneck dur- 
ing the divergence, we expect that the differences between the 
coalescent and the SMC will be smaller, since the bottleneck 
increases the chances that the genes will be linked right before 
the divergence. This means that as far as the correlation of times 
to the most recent common ancestor is concerned, bottlenecks 
right after the divergence have the same effect as decreasing the 
time D to the divergence. 

The third model we have considered is the two-island model 
with migration (Fig. [3};). The extension of the SMC approxi- 
mation to this model turned out to be more complicated than in 
the other two models, since it is necessary to keep track of not 



only the time to the most recent common ancestor in the geneal- 
ogy, but also of the sequence of migration events. Simulations 
of the exact coalescent and the SMC approximation showed 
that when the migration rate is large, the SMC approximation 
is accurate. For small and intermediate values of the migra- 
tion rate, by contrast, significant differences were found. The 
magnitude of the relative deviations between the exact result 
and the SMC approximation is similar to those in the popu- 
lation divergence model. Numerical analysis shows that when 
the recombination rate R between the two loci is sufficiently 
large, p decreases exponentially with increasing R in the SMC, 
whereas in the coalescent it is approximately inversely propor- 
tional to R in this regime. 

It follows from our results that in a more complex scenario 
such as that depicted in Fig.[3ji, one may expect the SMC ap- 
proximation to work well when individuals are sampled from 
the same sub population. When the individuals are sampled 
from different geographical locations, however, or more gen- 
erally when one may expect reduced gene flow between the 
different sample locations, the accuracy of the SMC needs to 
be carefully investigated. We have found that the accuracy de- 
pends upon the amount of gene flow within and between sub 
populations, on how recently divergences may have occurred, 
and upon the genetic distance between the loci in question. 
These dependencies are summarised in Figs. 5, 6, and 7. 
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A. Probability of linkage 

In this appendix we summarise how to calculate the probabil- 
ity of linkage in the SMC approximation, for a freely mixing 
population of constant size, and for the population-divergence 
model. We begin by discussing the constant population-size 
model, and then turn to the population divergence. 

A.l. Constant population size 

Consider a freely mixing population of constant size. To de- 
termine the linkage probability of two loci in a sample of size 
two, the first step is to calculate the loop probability <p(f), the 
probability that the detached line re-attaches to the line it orig- 
inated from before the two lines in the genealogy coalesce at 
time t (upper right panel of Fig. 2). Suppose that the recom- 
bination point occurs at time t (with < t < r). Since there 
are two lines to attach to, the time to attachment is exponen- 
tially distributed with rate two. Given that the line attaches be- 
fore time r, it attaches to the original line with probability 1 /2. 
Hence, the probability of a loop occuring from recombination 
point at time t is 



if- 



In SMC approximation, the recombination point t is chosen 
from a uniform distribution over the gene genealogy. Hence, 
the probability that a recombination event re-attaches to the 
same line and thus leaves the gene genealogy intact (this is the 
loop probability) is found by averaging over all recombination 
points t e [0,t]: 



(t)= - r dt- r dt'2e- i(,, - ,) 

t Jo 2 J, 



2t- 1 + 



„-2r 



4t 



(8) 



In the second step, we use the loop probability ([8]) in combina- 
tion with the distribution of the time to the most recent common 
ancestor for a single locus to calculate the probability of link- 
age. In a freely mixing population of constant size, the time to 
the most recent common ancestor is exponentially distributed 
with unit mean, 



m = e~ 



(9) 
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Inserting Eqs. (O and (0 into the general formula, Eq. re- 
sults in: 



a?Texp[-T-(2T+ 1 -« 

2 Jo 



- 2t )ria\ 



= 2 r/2 e -R/4 ( _ Rr (R-2)/* 7 ( (R + 2)1 A, -R/4) . (10) 

Here y is the so-called 'lower incomplete Gamma function' de- 
fined as y(a, b) = j Q df^'V. Eq. ((TUJ is our result for the 
probability of linkage for two loci in a sample of two individu- 
als for a freely mixing population of constant size in the SMC 
approximation. 

A.2. Population divergence model 

For the case of a population divergence, the calculation is 
complicated by the difference in population structure during 
different stages of the history. We assume that the two indi- 
viduals are sampled from different sub-populations, so that the 
most recent common ancestor occurred before the divergence. 
The time to the most recent common ancestor for the left-most 
locus has a shifted exponential distribution: 

ft ^ / eHT ~ D> forT>D nu 

f(j) = { n „.u„ m , ; „„ (11) 







If recombination and re-attachment occur during the divergence 
the result must be a loop, because the line corresponding to the 
other individual resides in the other sub-population. If recom- 
bination or re-attachment do not occur during the divergence, 
the re-attachment process is the same as in Sec. A. 1 : the line 
attaches with rate 2 until the most recent common ancestor is 
reached. 

Let tff(t, t) be the probability of a loop occuring, where t is the 
time of the recombination event and t is the time of the most 
recent common ancestor. First, consider the case of t < D, 
During the interval [t, D], the line re-attaches to the ancestral 
recombination graph with rate one, and if it re-attaches it is 
to the original line it detached from. Within the time interval 
[D, t], the line re-attaches with rate 2, and the probability that 
the re-attachment point is on the original line is 1 /2. This gives 
the contribution 



>J/(t, r) = 1 - e 



-iP-t) 
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+ -< 
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-2(t-D)1 
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to the loop probability. When the recombination point is before 
the divergence (D < t < t), we obtain 



,-2(t-01 



(13) 



using the same arguments as for t < D. Since the starting point 
is chosen uniformly from the gene genealogy, the loop proba- 
bility <p(r) is 



(T) 



If. 

T Jo 



dt iff(t, t) 

2(D + t) - 3 + 2e-° - (l - 2e-°) e 

4t 



2(t-D) 



(14) 



We insert Eqs. (fTTT) and (TBI) into the general expression for 
the loop probability, Eq. ©. Further, we perform a change of 
integration variable from t to s — exp[-2(r - D)]. This leads to 
Eq. ©. 

A.2.1. Upper bound for p i 

We now show how the bound ^ was obtained. We make use 
of the fact that the exponential function is convex and obtain the 
following inequality valid for < s < 1 , with equality only at 
the end points: 



-/f(l-2e- D )(l+j)/4 



< e 



-R(l-2e- D )/4 



(!-*)+< 



-R(l-2e- D )/2 t 



(15) 



Inserting this expression into Eq. (O results in Eq. ©. Plot- 
ting the right-hand side and comparing to the exact integral in 
Eq. (0 shows that the inequality Eq. © is very close to being 
an equality for the parameters used in Fig. [6] 

A.2.2. Lower bound for pi at D = 

For D = 0, the upper bound does not decrease exponentially 
with R, but in order to say that pi does not decrease exponen- 
tially we need a lower bound. We can obtain one such by writ- 
ing ([Tol l as 



PL 
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Using the inequality e at > (1 - tf, which is valid for all a > 
and < t < 1 , we obtain 



PL 
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dt( \ - f) 



(R-l)/2 
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(17) 



which we recognize as the probability of linkage in the SMC 
model. From this bound it is clear that pi decreases as l/R for 
large values of R, rather than decreasing exponentially as the 
exact result does, compare Eqs. © and ©. 
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